Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

20 ◾ Bioinformatics

an input and generates quality assessment reports including per base sequence quality, per

tile sequence quality, per sequence quality scores, per base sequence content, per sequence

GC content, per base N content, sequence length distribution, sequence duplication levels,

overrepresented sequences, adaptor content, and k-mer content. FastQC supports all vari-

ants of FASTQ formats and gzip-compressed FASTQ files.

We will download some public single-end FASTQ files from an NCBI BioProject with

an accession “PRJNA176149” for practicing purpose. The SRA files of this project contain

genomic single-end reads of Escherichia coli str. K-12. To keep the files organized, we can

create the directory “ecoli” using “mkdir ecoli” and then move it inside this directory “cd

ecoli” and save the following IDs (each in a line) in a text file with the file name “ids.txt”

using any text editor:

SRR653520

SRR653521

SRR576933

SRR576934

SRR576935

SRR576936

SRR576937

SRR576938

Then, run the following script to create the subdirectory “fastQC” and to download the

FASTQ files associated with the IDs stored in the “ids.txt” file into the directory:

mkdir fastQC

while read f;

fasterq-dump \

--outdir fastQC “$f” \

--progress \

--threads 4

done < ids.txt

Once the raw FASTQ files have been downloaded, we can use the command “ls -lh fastQC”

to display the file names as shown in Figure 1.10.

FIGURE 1.10 The names of the downloaded FASTQ files.